skip to main content


Search for: All records

Creators/Authors contains: "Yooseph, Shibu"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Abstract

    Metagenomics is the study of all genomic content contained in given microbial communities. Metagenomic functional analysis aims to quantify protein families and reconstruct metabolic pathways from the metagenome. It plays a central role in understanding the interaction between the microbial community and its host or environment. De novo functional analysis, which allows the discovery of novel protein families, remains challenging for high-complexity communities. There are currently three main approaches for recovering novel genes or proteins: de novo nucleotide assembly, gene calling and peptide assembly. Unfortunately, their information dependency has been overlooked, and each has been formulated as an independent problem. In this work, we develop a sophisticated workflow called integrated Metagenomic Protein Predictor (iMPP), which leverages the information dependencies for better de novo functional analysis. iMPP contains three novel modules: a hybrid assembly graph generation module, a graph-based gene calling module, and a peptide assembly-based refinement module. iMPP significantly improved the existing gene calling sensitivity on unassembled metagenomic reads, achieving a 92–97% recall rate at a high precision level (>85%). iMPP further allowed for more sensitive and accurate peptide assembly, recovering more reference proteins and delivering more hypothetical protein sequences. The high performance of iMPP can provide a more comprehensive and unbiased view of the microbial communities under investigation. iMPP is freely available from https://github.com/Sirisha-t/iMPP.

     
    more » « less
  2. Singh, Mona (Ed.)
    Microbial associations are characterized by both direct and indirect interactions between the constituent taxa in a microbial community, and play an important role in determining the structure, organization, and function of the community. Microbial associations can be represented using a weighted graph (microbial network) whose nodes represent taxa and edges represent pairwise associations. A microbial network is typically inferred from a sample-taxa matrix that is obtained by sequencing multiple biological samples and identifying the taxa counts in each sample. However, it is known that microbial associations are impacted by environmental and/or host factors. Thus, a sample-taxa matrix generated in a microbiome study involving a wide range of values for the environmental and/or clinical metadata variables may in fact be associated with more than one microbial network. Here we consider the problem of inferring multiple microbial networks from a given sample-taxa count matrix. Each sample is a count vector assumed to be generated by a mixture model consisting of component distributions that are Multivariate Poisson Log-Normal. We present a variational Expectation Maximization algorithm for the model selection problem to infer the correct number of components of this mixture model. Our approach involves reframing the mixture model as a latent variable model, treating only the mixing coefficients as parameters, and subsequently approximating the marginal likelihood using an evidence lower bound framework. Our algorithm is evaluated on a large simulated dataset generated using a collection of different graph structures (band, hub, cluster, random, and scale-free). 
    more » « less
  3. Qian, Wenfeng (Ed.)
    Abstract Despite the medical significance to humans and important ecological roles filled by vipers, few high-quality genomic resources exist for these snakes outside of a few genera of pitvipers. Here we sequence, assemble, and annotate the genome of Fea’s Viper (Azemiops feae). This taxon is distributed in East Asia and belongs to a monotypic subfamily, sister to the pitvipers. The newly sequenced genome resulted in a 1.56 Gb assembly, a contig N50 of 1.59 Mb, with 97.6% of the genome assembly in contigs >50 Kb, and a BUSCO completeness of 92.4%. We found that A. feae venom is primarily composed of phospholipase A2 (PLA2) proteins expressed by genes that likely arose from lineage-specific PLA2 gene duplications. Additionally, we show that renin, an enzyme associated with blood pressure regulation in mammals and known from the venoms of two viper species including A. feae, is expressed in the venom gland at comparative levels to known toxins and is present in the venom proteome. The cooption of this gene as a toxin may be more widespread in viperids than currently known. To investigate the historical population demographics of A. feae, we performed coalescent-based analyses and determined that the effective population size has remained stable over the last 100 kyr. This suggests Quaternary glacial cycles likely had minimal influence on the demographic history of A. feae. This newly assembled genome will be an important resource for studying the genomic basis of phenotypic evolution and understanding the diversification of venom toxin gene families. 
    more » « less
  4. Abstract

    SAR86 is an abundant and ubiquitous heterotroph in the surface ocean that plays a central role in the function of marine ecosystems. We hypothesized that despite its ubiquity, different SAR86 subgroups may be endemic to specific ocean regions and functionally specialized for unique marine environments. However, the global biogeographical distributions of SAR86 genes, and the manner in which these distributions correlate with marine environments, have not been investigated. We quantified SAR86 gene content across globally distributed metagenomic samples and modeled these gene distributions as a function of 51 environmental variables. We identified five distinct clusters of genes within the SAR86 pangenome, each with a unique geographic distribution associated with specific environmental characteristics. Gene clusters are characterized by the strong taxonomic enrichment of distinct SAR86 genomes and partial assemblies, as well as differential enrichment of certain functional groups, suggesting differing functional and ecological roles of SAR86 ecotypes. We then leveraged our models and high-resolution, remote sensing-derived environmental data to predict the distributions of SAR86 gene clusters across the world’s oceans, creating global maps of SAR86 ecotype distributions. Our results reveal that SAR86 exhibits previously unknown, complex biogeography, and provide a framework for exploring geographic distributions of genetic diversity from other microbial clades.

     
    more » « less